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Abstract 

We consider the sparse inverse covariance regularization problem or graphi- 
cal lasso with regularization parameter A. Suppose the sample covariance graph 
formed by thresholding the entries of the sample covariance matrix at A is decom- 
posed into connected components. We show that the vertex-partition induced by 
the connected components of the thresholded sample covariance graph is exactly 
equal to that induced by the connected components of the estimated concentra- 
tion graph, obtained by solving the graphical lasso problem. This characterizes 
a very interesting property of a path of graphical lasso solutions. Furthermore, 
this simple rule, when used as a wrapper around existing algorithms for the 
graphical lasso, leads to enormous performance gains. For a range of values 
of A, our proposal splits a large graphical lasso problem into smaller tractable 
problems, making it possible to solve an otherwise infeasible large-scale problem. 
We illustrate the graceful scalability of our proposal via synthetic and real-life 
microarray examples. 



1 Introduction 

Consider a data matrix X nxp comprising of n sample realizations from a p dimensional 
Gaussian distribution with zero mean and positive definite covariance matrix S (un- 
known), ie Xi : ~ d MVN(0, S). The task is to estimate the unknown X based on the 
n samples. t\ regularized Sparse Inverse Covariance Selection also known as graphical 
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lasso 



Friedman et al 



20071 . iBanerjee et al. 



2008 



Yuan and Linl . 120071 ] estimates the 



covariance matrix S, under the assumption that the inverse covariance matrix i.e. 
S _1 is sparse. This is achieved by minimizing the regularized negative log-likelihood 
function: 



minimize — logdet(0) + tr(S0) + A |0 

j 



'%3 |) 



where S is the sa mple covariance matrix. Problem (pQ) is a convex optimization problem 



(A) 



Boyd and Vandenberghel . 120041 ] . Let denote the solution to (TJ_ 



in the variable 

We note that (pQ) can also be used in a more non-parametric fashion for any positive 
semidefmite input matrix S, not necessarily a sample covariance matrix of a MVN 
sample as described above. 

A related criterion to (pQ) is one where the diagonals are not penalized — we will 
study (pQ) in this paper. 

Developing efficient large-scale algorithms for ([1]) is an active area of research across 
the fields of Convex Optimization, Machine Learning and Statistics. Many algori t hms 
have been proposed for this task 



2009 



been propc 
201ol . lschei 



20101 . iScheinberg et al. 



2010. 



Friedman et al 



Sra and Kim 



2007, 



2011 



Baneriee et all 12 008. 



Yuan 



Li and Tohl . 



2010. 



for example]. However, it appears that certain special properties of the solution to 
(PQ) have been largely ignored. This paper is about one such (surprising) property 
- namely establishing an equivalence between the vertex-partition induced by the 
connected components of the non-zero patterns of 0^ ^ and the thresholded sample 
covariance matrix S. This paper is not about a specific algorithm for the problem 
(PQ) — it focuses on the aforementioned observation that leads to a novel threshold- 
ing/screening procedure based on S. This provides interesting insight into the path of 
solutions {0 ^}a>o obtained by solving (pQ), over a path of A values. The behavior of 
the connected-components obtained from the non-zero patterns of {0 ^}a>o can be 
completely understood by simple screening rules on S. This can be done without even 
attempting to solve ([1]) — arguably a very challenging convex optimization problem. 
Furthermore, this thresholding rule can be used as a wrapper to enormously boost the 
performance of existing algorithms, as seen in our experiments. This strategy becomes 
extremely effective in solving large problems over a range of values of A — sufficiently 
restricted to ensure sparsity and the separation into connected components. 

At this point we introduce some notation and terminology, which we will use 
throughout the paper. 
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1.1 Notations and preliminaries 

For a matrix Z, its (i, j) th entry is denoted by Zy. 

We also introduce some graph theory notations and definitions 
for example]. A finite undirected graph Q on p vertices is given by the ordered tuple 
Q = (V, £), where V is the set of nodes and £ the collection of (undirected) edges. 
The edge-set is equivalently represented via a (symmetric) 0-1 matrijQ (also known 
as the adjacency matrix) with p rows/columns. We use the convention that a node is 
not connected to itself, so the diagonals of the adjacency matrix are all zeros. Let |V| 
and \£\ denote the number of nodes and edges respectively. 

We say two nodes u, v G V are connected if there is a path between them. A 
maximal connected stzfrgrop/Jfl is a connected component of the graph Q. Connected- 
ness is an equivalence relation that decomposes a graph Q into its connected compo- 
nents {(Vi, £e)}i<£<K — with Q = Uf =1 (Ve, £e), where K denotes the number of con- 
nected components. This decomposition partitions the vertices V of Q into {Ve}i<e<K- 
Note that the labeling of the components is unique upto permutations on {1, . . . , K}. 
Throughout this paper we will often refer to this partition as the vertex-partition in- 
duced by the components of the graph Q. If the size of a component is one i.e. \V(\ = 1, 
we say that the node is isolated. Suppose a graph Q defined on the set of vertices V 
admits the following decomposition into connected components: Q = \jf =l iy^ £g). We 
say the vertex-partitions induced by the connected components of Q and Q are equal 
if K = K and there is a permutation ir on {1, . . . , K} such that V 7r (^) = Ve for all 
£e{l,...,K}. 

The paper is organized as follows. Section [2] describes the covariance graph thresh- 
olding idea, along with theoretical justifications and related work, followed by com- 
plexity analysis of the algorithmic framework in Section [3j Numerical experiments 
appear in Section HI concluding remarks in Section [5] and the proofs are gathered in 
the Appendix lAl 



2 denotes absence of an edge and 1 denotes its presence. 
3 G' = (V, £') is a subgraph of Q if V C V and £ ' c £. 
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2 Methodology: Exact Thresholding of the Covari- 
ance Graph 

The sparsity pattern of the solution to (Tj[|) gives rise to the symmetric edge 
matrix/skeleton G {0, l} pxp defined by: 

£{X ) = f 1 if 0^ ^ 0, i^ j; ^ 
1 otherwise. 



The above defines a symmetric graph G^ x \ = (V,£^ x ] ) : nam ely the estimated 



con- 



centration graph Cox and Wermuthl . Il996l . iLauritzenl . Il996l | defined on the nodes 



V={l,...,p} with edges £ {x) . 

Suppose the graph admits a decomposition into k(A) connected components: 

g W = u*W^(A) (3) 

where = (Vg,£f^) are the components of the graph Q^ x \ Note that «(A) G 
{1, . . . ,p}, with k(X) = p (large A) implying that all nodes are isolated and for small 
enough values of A, there is only one component i.e. k(X) = 1. 

We now describe the simple screening/thresholding rule. Given A we perform a 
thresholding on the entries of the sample covariance matrix S and obtain a graph edge 
skeleton E {A) e {0, l} pxp defined by: 

E (A) = { 1 if|Sd>W,; (4) 
I otherwise. 

The symmetric matrix defines a symmetric graph on the nodes V = {1, ■ ■ ■ ,p} 
given by G^ = (V, E^). We refer to this as the thresholded sample covariance graph. 
Similar to the decomposition in ([3]), the graph also admits a decomposition into 
connected components: 

Q (A) = uJW G W, (5) 

where = (Vj , E^') are the components of the graph G^ x \ 

Note that the components of Q ( ' require knowledge of - the solution to (EL]). 
Construction of G^ and its components require operating on S — an operation that 
can be performed completely independent of the optimization problem ([T]), which is 
arguably more expensive (See Section [3]). The surprising message we describe in this 
paper is that the vertex-partition of the connected components of (JHJ) is exactly equal 
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to that of ©. 

This observation has the following consequences: 

1. We obtain a very interesting property of the path of solutions {©*■ ^}a>o — the 
behavior of the connected components of the estimated concentration graph can 
be completely understood by simple screening rules on S ! 

2. The cost of computing the connected components of the thresholded sample co- 
variance graph fl5]) is orders of magnitude smaller than the cost of fitting graphical 
models (pQ). Furthermore, the computations pertaining to the covariance graph 
can be done off-line and is amenable to parallel computation (See Section [3]). 

3. The optimization problem ([T]) completely separates into k(X) separate optimiza- 
tion sub-problems of the form (TjQ). The sub-problems have size equal to the 
number of nodes in each component pi := |Vj|, % — 1, . . . , k(X). Hence for certain 
values of A, solving problem ([T]), becomes feasible although it may be impossible 
to operate on the p x p dimensional (global) variable © on a single machine. 

4. Suppose that for Ao, there are k(\ ) components and the graphical model compu- 
tations are distributee^ . Since the vertex-partitions induced via fl3]) and ()5]) are 
nested with increasing A (see Theorem [2]), it suffices to operate independently on 
these separate machines to obtain the path of solutions j©*" }\ for all A > Ao- 

5. Consider a distributed computing architecture, where every machine allows oper- 
ating on a graphical lasso problem fll]) of maximal size p max - Then with relatively 
small effort we can find the smallest value of A = A Pmax , such that there are no 
connected components of size larger than p max - Problem ([1]) thus 'splits up' in- 
dependently into manageable problems across the different machines. When this 
structure is not exploited the global problem (TjQ) remains intractable. 

The following theorem establishes the main technical contribution of this paper — 
the equivalence of the vertex-partitions induced by the connected components of the 
thresholded sample covariance graph and the estimated concentration graph. 

Theorem 1. For any A > 0, the components of the estimated concentration graph 
G^ x \ as defined in (EJ] and (Gjj induce exactly the same vertex-partition as that of the 



distributing these operations depend upon the number of processors available, their capacities, 
communication lag, the number of components and the maximal size of the blocks across all machines. 
These of-course depend upon the computing environment. In the context of the present problem, it 
is often desirable to club smaller components into a single machine. 
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thresholded sample covariance graph , defined in Op and |3J). That is k(A) = k(X) 
and there exists a permutation it on {1, ... , k(X)} such that: 



^ A) = V^, Vi = l,...,*(A). (6) 

Proof. The proof of the theorem appears in Appendix IA.ll □ 

Since the decomposition of a symmetric graph into its connected components de- 
pends upon the ordering/ labeling of the components, the permutation n appears in 
Theorem [TJ 

Remark 1. Note that the edge- structures within each block need not be preserved. 

Under a matching reordering of the labels of the components of and : 

for every fixed £ such that = vj the edge-sets £^ and are not necessarily 

equal. 

Theorem [1] leads to a special property of the path-of-solutions to (CEJ), i.e. the vertex- 
partition induced by the connected components of are nested with increasing A. 
This is the content of the following theorem. 

Theorem 2. Consider two values of the regularization parameter such that A > A' > 
0, with corresponding concentration graphs and Q^'^ as in (TJ|) and connected 
components Then the vertex-partition induced by the components of are 

nested within the partition induced by the components ofQ^ x '\ Formally, k(X) > k(X') 
and the vertex-partition {V^}i<£< K (x) forms a finer resolution of {v| A ^}i<e<n{X')- 

Proof. The proof of this theorem appears in the Appendix IA.21 □ 

Remark 2. It is worth noting that Theorem^ addresses the nesting of the edges across 
connected components and not within a component. In general, the edge-set £^ of 
the estimated concentration graph need not be nested as a function of X: 
for X > X', in general, £ {x) <£ £ {x>) . 



Sec 



Friedman et al. 



20071 . Figure 3], for numerical examples demonstrating the 



non-monotonicity of the edge- set across A, as described in Remark [21 
2.1 Related Work 



Witten and Friedman 



2011| fairly recently proposed a scheme to detect isolated nodes 



for problem ([T]) via a simp le screening of the entries of S. Using the notation in 

20 111 . Algorithm 1] , the authors propose operating criterion ([!]) 



Witten and Friedman 
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on the set of non-isolated nodes (obtained from the sample covariance matrix) i.e. 
{1, . . . , p} \ C, where the isolated nodes are given by C: 



C = {i: \Sij\ < A,Vj ^ i}. 



(7) 



The authors showed that C is exactly equivalent to the set of isolated nodes of the 
estimated precisio n matrix obtained by solving (JT|) on the entire p x p dimensional 
problem. Earlier, iBanerjee et al.l 2008| [Theorem 4] also made the same observation. 
This is of-course related to a very special case of the proposal in this paper. Suppose in 
Theorem [TJ the estimated concentration gra ph admits a decompos i tion w here some of 
the connected components have size one 



Witten and Friedman 



201lj only screens 



the isolated nodes and treats the the remaini ng nodes as a separate ' conne cted unit'. 
Although the original version of their paper Witten and Friedman! . 1201 lj deal with 



Witten 



single nodes, we have learned 
discovered a form of block screening. 



pa] 



12-2011] that with N. Simon they have also 



This node-screening strategy 
ical lasso) GLASSO algorithm of 



7)) was used by them a s a wrapper around the (graph- 



Friedman et al. 



2007| - leading t o substantial im- 



Friedman et al 



2007J (CRAN GLASSO 



provements over the existing GLASSO solver of 
package version 1.4). However, we show below that the node- screening idea is actually 
an immediate consequence of the block coordinate-wise updates used by the GLASSO 
algorithm — an observation that was no t exploited by the solv er. 

Recall that the GLASSO algorithm Friedman et al.l . 2007 1 operates in a block- 
coordinate-wise i.e. row/column fashion on the variable W = 0~ . The method 
partitions the problem variables as follows: 







011 012 

$21 #22 




w 



Wn W12 

W 2 1 W 2 2 



where the last row/ column represents the optimization variable, the others being 
fixed. The partial optimization problem w.r.t. the last row/column (leaving apart the 
diagonal entry) is given by: 



9 



12 



arg mm 

012 



-0 12 Wii0i 2 + #i ,#22812 + A#22||0 



12 t7 22»12 



'12||1 



(9) 



Clearly the solution #i 2 of the above ([9]) is zero iff 



l S 12 



< A 



(10) 
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- a condition depending only on that row/column of S. As we pointed out before, 
th e above condition (TTOt) is ex actly the condition for node-screening (J7|) described 



in 



Witten and Friedmar 



Witten and Friedman 



solver of 



201 1 |. The notable improvement in timings observed in 



20111] w ith node screening goes on to suggest that the glasso 



Friedman et al 



(as implemented in CRAN GLASSO package Version 
1.4) does not make the check fflOl) . before going on to solve problem (Q. The existing 
implementation goes on to optimize t^fj - - a t\ regularized quadratic program via 
cyclical coordinate-descent. Note that OH]), in its own right, is fairly challenging to 
solve for large problems. 



3 Computational Complexity 



The overall complexity of our proposal depends upon (a) the graph partition stage and 
(b) solving (sub)problems of the form (CQ). In addition to these, there is an unavoidable 
complexity associated with handling and/or forming S. 

The cost of computing the connected components of the thresholded covariance 
graph is fairly negligible when compared to solving a similar sized graphical lasso 
problem — see also our simulation studies in Section [U In case we observe samples 
Xi G 9ft p , % = 1, . . . , n the cost for creating the sample covariance matrix S is 0(n ■ p 2 ). 
Thresholding the sample covariance matrix costs 0(p 2 ). Obtaining the connected 
components of the thresholded covariance graph costs 0(|E (A) | + p) [Tarianl . Il972j |. 
Since we are interested in a region where the thresholded covariance graph is sparse 
enough to be broken into smaller connected components - |E^| <C p 2 . Note that 
all computations pertaining to the construction of the connected components and the 
task of compu ting S can 



parallelizable. 



Gazit 



1991 



De computed off-line. Furthermore the computations are 
for example] describes parallel algorithms for computing 
connected components of a graph — they have a time complexity 0(logp) and require 
0((|E (A) | +p)/log(p)) processors with space 0(p+ |E (A) |). 

There are a wide variety of algorithms for the task of solving (TjQ). While an exhaus- 
tive review of the computational complexities of the different algorithms is beyond the 
sco pe of this paper, we p rovide a brief summary for a few algorithms below. 



Banerjee et al.l [20Q8j ] proposed a smooth accelerated gradient based method 



Nesterov, 



4.5 

20051 ] with complexity 0(- — ) to obtain an e accurate solution — the per iteration cost 
being 0(p 3 ). They also proposed a block coordinate method which has a complexity 
of 0{p 4 ). 



The complexity of the GLASSO algorithm Friedman et al.l . 120071 ] which uses a row 
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by- row block coordinate method is roughly 0(p 3 ) for reasonably sparse-problems with 
p nodes. For denser problems the cos t can be as large as 0(p A ). 



The algorithm smacs proposed in|Lu| 20101 ] has a per iteration complexity of 0(p 3 ) 



and an overall complexity of O(^) to obtain an e > accurate solution. 

It appears that most existing algorithms for flTJ, have a complexity of at least 
0(p 3 ) to 0(p 4 ) or possibly larger, depending upon the algorithm used and the desired 
accuracy of the solution — making computations for (pQ) almost impractical for values 
of p much larger than 2000. 

It is quite clear that the role played by covariance thresholding is indeed crucial 
in this context. Assume that we choose to use a solver of complexity 0(p J ), with 
J G {3, 4}, along with our screening procedure. Suppose for a given A the thresholded 
sample covariance graph has k(X) components — the total cost of solving these smaller 
problems is then Yl^i 0(\Vi \ J ) ^ 0(p J ), with J G {3, 4}. This difference in practice 
can be enormous — see Section [4] for numerical examples. This is what makes large 
scale graphical lasso problems solvable ! 



4 Numerical examples 

In this section we show via numerical experiments that the screening property helps in 
obtaining many fold speed-ups when compared to an algorithm that does not exploit it. 
Section 14.11 considers synthetic examples and Section 14.21 discusses real-life microarray 
data-examples. 



4.1 Synthetic examples 



Experiments are performed with two publicly available algorithm implementations for 
the problem (JTJ): 



GLASSO: The algorithm of 



Friedman et al 



20071 . We used the MATLAB wrap- 



per available at http://www-stat.stanford.edu/~tibs/glasso/index.html 



to the Fortran code. The specific criterion for convergence (lack of progress of 
the diagonal entries) was set to 10 -5 and the maximal number of iterations was 
set to 1000. 



SMACS: denotes the algorithm of |Lu| 20101 ] . We used the MATLAB implementation 



smooth_covsel available at http : //people .math. sfu. ca/~zhaosong/Codes/SM00TH_C0VSEL/ 

The criterion for convergence (based on duality gap) was set to 10~ 5 and the 
maximal number of iterations was set to 1000. 
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We will like to note that the convergence criteria of the two algorithms GLASSO and 
SMACS are not the same. For obtaining the connected components of a symmetric 
adjacency matrix we used the MATLAB function graphconncomp. All of our compu- 
tations are done in MATLAB 7.11.0 on a 3.3 GhZ Intel Xeon processor. 

The simulation examples are created as follows. We generated a block diagonal 
matrix given by S = blkdiag(Si, . . . , Sk), where each block = l Pe xp e — a matrix 
of all ones and J2eP? = P- m the examples we took all p/s to be equal to p\ (say). 
Noise of the form a ■ UU 1 (U is a p x p matrix with i.i.d. standard Gaussian entries) 
is added to S such that 1.25 times the largest (in absolute value) off block-diagonal 
(as in the block structure of S) entry of a ■ UU' equals the smallest absolute non-zero 
entry in S i.e. one. The sample covariance matrix is S = S + a ■ UU' . 

We consider a number of examples for varying K and p± values, as shown in Table 
[TJ Sizes were chosen such that it is at-least 'conceivable' to solve ([I]) on the full 
dimensional problem, without screening. In all the examples shown in Table [U we set 
A/ := (A max + A m ; n )/2, where for all values of A in the interval [A m ; n , A max ] the thresh- 
holded version of the sample covariance matrix has exactly K connected components. 
We also took a larger value of A i.e. \n := A max , which gave sparser estimates of the 
precision matrix but the number of connected components were the same. 

The computations across different connected blocks could be distributed into as 
many machines. This would lead to almost a K fold improvement in timings, however 
in Table [D we report the timings by operating serially across the blocks. The serial 
'loop' across the different blocks are implemented in MATLAB. 

Table [1] shows the rather remarkable improvements obtained by using our proposed 
covariance thresholding strategy as compared to operating on the whole matrix. Tim- 
ing comparisons between GLASSO and SMACS are not fair, since GLASSO is written in 
Fortran and SMACS in MATLAB. However, we note that our experiments are meant 
to demonstrate how the thresholding helps in improving the overall computational 
time over the baseline method of not exploiting screening. It is interesting to observe 
that there is almost a role-reversal in the performances of GLASSO and SMACS with 
changing A values, for the cases with screening. \j corresponds to a denser solution of 
the precision matrix — here GLASSO converges more slowly than SMACS. For larger 
values of the tuning parameter i.e. A = Xjj, the solutions are sparser — GLASSO 
converges much faster than SMACS. For problems without screening we observe that 
GLASSO converges much faster than SMACS, for both values of the tuning parameter. 
This is probably because of the intensive matrix computations associated with the 
SMACS algorithm. Clearly our proposed strategy makes solving larger problems (pQ), 
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Algorithm Timings (sec) Ratio Time (sec) 



K 


Pi / P 


A 


Algorithm 


with 

screen 


without 
screen 


Speedup 
factor 


graph 
partition 


2 


200 / 400 


A/ 


GLASSO 


11.1 


25.97 


2.33 


0.04 






O lvl r\ v_/ o 


12.31 


137 45 


11.16 






Xtt 


GLASSO 


1.687 


4.783 


2.83 


0.066 






SMACS 


10.01 


42.08 


4.20 


2 


500 /1000 


\l 


GLASSO 


305.24 


735.39 


2.40 


0.247 








1 75 


21 38* 


1 2 21 






Xtt 


GLASSO 


29.8 


121.8 


4.08 


0.35 






SMACS 


272.6 


1247.1 


4.57 


5 


300 /1500 


Xj 


GLASSO 


210.86 


1439 


6.82 


0.18 






63.22 


6062* 


95 88 






Xtt 


GLASSO 


10.47 


293.63 


28.04 


0.123 






SMACS 


219.72 


6061.6 


27.58 


5 


500 /2500 


Xj 


GLASSO 

«M APS 


1386.9 
493 






0.71 








GLASSO 
SMACS 


17.79 
354.81 


963.92 


54.18 


U.Ulo 


8 


300 /2400 


A/ 


GLASSO 
SMACS 


692.25 
185.75 






0.713 






X H 


GLASSO 
SMACS 


9.07 
153.55 


842.7 


92.91 


0.023 



Table 1: Table showing (a) the times in seconds with screening, (b) without screening 
i.e. on the whole matrix and (c) the ratio (b)/(a) - 'Speedup factor' for algorithms 
GLASSO and SMACS. Algorithms with screening are operated serially — the times 
reflect the total time summed across all blocks. The column 'graph partition' lists the 
time for computing the connected components of the thresholded sample covariance 
graph. Since X n > A/, the former gives sparser models. '*' denotes the algorithm did 
not converge within 1000 iterations. '-' refers to cases where the respective algorithms 
failed to converge within 2 hours. 

not only feasible but with quite attractive computational time. The time taken by the 
graph-partitioning step in splitting the thresholded covariance graph into its connected 
components is negligible as compared to the timings for the optimization problem. 
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4.2 Micro-array Data Examples 

The graphical lasso is often used in learning connectivity networks in gene-microarray 
data 



Friedman et al 



20071 . see for example]. Since in most real examples the num- 
ber of genes p is around tens of thousands, obtaining an inverse covariance matrix by 
solving (|T]) is computationally impractical. The covariance thresholding method we 
propose easily applies to these problems — and as we see gracefully delivers solutions 
over a large range of the parameter A. We study three different micro- array examples 
and observe that as one varies A from large to small values, the thresholded covariance 
graph splits into a number of non-trivial connected components of varying sizes. We 
continue till a small/moderate value of A when the maximal size of a connected com- 
ponent gets larger than a predefined machine-capacity or the 'computational budget' 
for a single graphical lasso problem. Note that in relevant micro-array applications, 
since p ^> n (n, the number of samples is at most a few hundred) heavy regulariza- 
tion is required to control the variance of the covariance estimates — so it does seem 
reasonable to restrict to solutions of (CQ) for large values of A. 
Following are the data-sets we used for our experiments: 



(A) This d ata-set appears in lAlon et al.l 1999| and has been analyzed by 



Rothman et al 



2008 



for example]. In this experiment, tissue samples were analyzed using an 
Aftymetrix oligonucleotide array. The data were processed, filtered and reduced 
to a subset of p = 2000 gene expression values. The number of colon adenocar- 
cinoma tissue samples is n = 62. 

(B) This is an early example of an expression array, obtained from the Patrick Brown 
lab at Stanford University. There are n = 385 patient samples of tissue from 
various regions of the body (some from tumors, some not), with gene-expression 
measurements for p = 4718 genes. 

(C) The third example is the by now famo us NKI dataset that produ ced the 70-gene 



prognostic signature for breast cancer Ivan de Vijver et al.l 2002| . Here there are 
n = 295 samples and p = 24481 genes. 

Among the above, both (B) and (C) have few missing values — which we imputed by 
the respective global means of the observed expression values. For each of the three 
data-sets, we took S to be the corresponding sample correlation matrix. 

Figure [1] shows how the component sizes of the thresholded covariance graph change 
across A. We describe the strategy we used to arrive at the figure. Note that the 
connected components change only at the absolute values of the entries of S. From the 
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sorted absolute values of the off-diagonal entries of S, we obtained the smallest value 
of A, say A^ in , for which the size of the maximal connected component was 1500. For a 
grid of values of A till A' min , we computed the connected components of the thresholded 
sample- covariance matrix and obtained the size-distribution of the various connected 
components. Figure [TJ shows how these components change over a range of values 
of A for the three examples (A), (B) and (C). The number of connected components 
of a particular size is denoted by a color-scheme, described by the color-bar in the 
figures. With increasing A: the larger connected components gradually disappear 
as they decompose into smaller components; the sizes of the connected components 
decrease and the frequency of the smaller components increase. Since these are all 
correlation matrices, for A > 1 all the nodes in the graph become isolated. The 
range of A values for which the maximal size of the components is smaller than 1500 
differ across the three examples. For (C) there is a greater variety in the sizes of the 
components as compared to (A) and (B). Note that by Theorem [TJ the pattern of the 
components appearing in Figure [TJ are exactly the same as the components appearing 
in the solution of ([TJ for that A. 

(B) p=4718 
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Figure 1: Figure showing the size distribution (in the log-scale) of connected compo- 
nents arising from the thresholded sample covariance graph for examples (A)-(C). For 
every value of A (vertical axis), the horizontal slice denotes the sizes of the different 
components appearing in the thresholded covariance graph. The colors represent the 
number of components in the graph having that specific size. For every figure, the 
range of A values is chosen such that the maximal size of the connected components 
do not exceed 1500. 



Continuing from Table [TJ we proceed to show that the screening rules lead to 
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Average size 




Algorithm Timings (sec) 


Ratio 


Time (sec) 


of maximal 


Algorithm 


with 


without 


Speedup 


graph 


component 




screen 


screen 


factor 


partition 


5 


GLASSO 


0.02 


3866 


1.9 x 10 5 


0.009 




SMACS 


0.87 


1.16 x 10 5 


1.33 x 10 5 


727 


GLASSO 
SMACS 


413 
4285 


13214 
2.7 x 10 5 


32 
63 


0.14 



Table 2: Timings for Eg (A): table showing times with/without screening, 'Speedup 
factor' and time for 'graph partition' as in Tabled! for two different ranges of A- values. 
Here p = 2000 and the times for each of the two columns are summed over 10 different 
A values. The left-most column is the size of the maximal connected component, 
averaged across the A values. We see that the times increase with decreasing sparsity 
and the speed-up factor is impressively large when there are a large number of small- 
sized connected components. The cost of computing the components of the thresholded 
covariance graph is relatively negligible. 

encouraging speed-ups for real-data examples as well. We consider example (A) and 
apply on it GLASSO and SMACS with and without screening on a grid of A values. 
For smaller values of A in the range, GLASSO and SMACS take a very long time to 
converge. Comparative timings appear in Table [2J In these experiments we took the 
criterion of convergence for both GLASSO and SMACS as 10~ 4 and they were run till 
a maximum of 500 iterations. The algorithms were run independently across the grid 
of A values chosen. The times displayed for the algorithms with screening indicate the 
total time required by solving the smaller sub-problems serially — the results shown 
are to emphasize the speed-ups obtained in each of the algorithms via screening. 

For examples (B) and (C) the full problem sizes are beyond the scope of GLASSO 
and SMACS — the screening rule is apparently the only way to obtain solutions for a 
reasonable range of A- values as shown in Figure [TJ We report in Table [3] the averaged 
time taken by each of GLASSO and SMACS over a grid of 100 A-values, for examples 
(B) and (C). The 100 A values correspond to the top 2 % sorted absolute values of the 
off-diagonal entries in S below A 50 o — where A 50 o is the smallest value of A such that 
the maximal component in the thresholded covariance graph has size 500. 

5 Conclusions 

In this paper we present a novel property characterizing the family of solutions to the 
graphical lasso problem (pQ), as a function of the regularization parameter A. The prop- 
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Example / p 


Average size of 
maximal component 


Algorithm Timings (sec) 
GLASSO SMACS 


Time(sec) 
graph-partition 


(B) / 4718 


330 


4.93 


31.09 


0.0082 


(C) / 24481 


461 


15.4 


141 


0.0186 



Table 3: Averaged timings (in sees) for different algorithms for examples (B) and (C) 
over a grid of 100 A values, as described in the text. Algorithms are applied with the 
screening rule. 



erty is fairly surprising — the vertex partition induced by the connected components 
of the non-zero patterns of the estimated concentration matrix and the thresholded 
sample covariance matrix S are exactly equal. This property seems to have been un- 
observed in the literature. Our observation not only provides interesting insights into 
the properties of the graphical lasso solution-path but also opens the door to solving 
large-scale graphical lasso problems, which are otherwise intractable. This simple rule 
when used as a wrapper around existing algorithms leads to enormous performance 
boosts — on occasions by a factor of thousands! 



A Proofs 

A.l Proof of Theorem [1] 

Proof. Suppose (we suppress the superscript A for nota tional convenience) solves 



probl em (Op), then standard KKT conditions of optimality Boyd and Vandenberghd . 



2004j give: 



- Wy | < A V Qij = 0; and (11) 
Wij = Sij + A Vey>0; Wij = Sy - A V y < 0; (12) 

where W = (0) _1 . The diagonal entries satisfy W# = Su + A, for i = 1, . . . ,p. 

Using (jlj) and (jHJ), there exists an ordering of the vertices {1, . . . ,p} of the graph 
such that E' A ' is block-diagonal. For notational convenience, we will assume that the 
matrix is already in that order. Under this ordering of the vertices, the edge-matrix 
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of the thresholded covariance graph is of the form: 



/E[ x) 



E (A) 



E 



o \ 



E i(V 



(13) 



where the different components represent blocks of indices given by: Vf"\l = 1, . . . , k(X). 

We will construct a matrix W having the same structure as ( TTBl) which is a solution 
to ([1]). Note that if W is block diagonal then so is its inverse. Let W and its inverse 
be given by: 



W 



/Wi ••■ 

W 2 



\ 



e 



f&t o ••• o \ 

2 



o w k(x) J 



(14) 



0fc(A)/ 



Define the block diagonal matrices or equivalently ®e via the following sub- 
problems 



£ = argmin {-logdet(0*) + tr(S e @ e ) + (®/)« 1} 



(15) 



for i = 1, ...,k(X), where is a sub-block of S, with row/column indices from 
Vf ) x yj A) . The same notation is used for 0^. Denote the inverses of the block- 
precision matrices by {©^j -1 = W^. We will show that the above satisfies the 
KKT conditions — (fTTj) and (fT2"|). 

Note that by construction of the thresholded sample covariance graph, 
if i G V £ (A) and j G Vj, A) with t ^ £' , then |S -| < A. 

Hence, for i G Vf^ and j G with i ^ £'; the choice 0^- = = satisfies the 
KKT conditions fllT) 

\Sij-Wijl < A 

for all the off-diagonal entries in the block- matrix (fTB"]) . 

By construction ffl~5"l) it is easy to see that for every £, the matrix 0^ satisfies the 
KKT conditions fllip and (1121) corresponding to the £ th block of the p x p dimensional 
problem. Hence solves problem (TTJ. 

The above argument shows that the connected components obtained from the 
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estimated precision graph leads to a partition of the vertices {V^}i<e< K (X) such 
that for every t G {1, . . . , fc(A)}, there is a f G {1, . . . , «(A)} such that 9j, A) C Vf } . In 
particular fc(A) < k(X). 

Conversely, if @ admits the decomposition as in the statement of the theorem, 
then it follows from (ITTj) that: 

for % G 9j A) and j G 9j, A) with £ ^ f ; |Sy - Wy| < A. Since W {j = 0, we have 
\Sij\ < A. This proves that the connected components of G^ leads to a partition of 
the vertices, which is finer than the vertex-partition induced by the components of 
Q^ x \ In particular this implies that k(X) > k(X). 

Combining the above two we conclude k(X) = k(\) and also the equality (jSJ). The 
permutation 7r in the theorem appears since the labeling of the connected components 
is not unique. □ 

A. 2 Proof of Theorem H 

Proof. This proof is a direct consequence of Theorem [Tj which establishes that the 
vertex-partitions induced by the the connected components of the estimated precision 
graph and the thresholded sample covariance graph are equal. 

Observe that, by construction, the connected components of the thresholded sam- 
ple covariance graph i.e. are nested within the connected components of G^- x '\ In 
particular, the vertex-partition induced by the components of the thresholded sample 
covariance graph at A, is contained inside the vertex-partition induced by the com- 
ponents of the thresholded sample covariance graph at A'. Now, using Theorem [1] we 
conclude that the vertex-partition induced by the components of the estimated preci- 
sion graph at A, given by {V^}i<e< K (\) is contained inside the vertex-partition induced 

— (AM i 

by the components of the estimated precision graph at A', given by {V^ }i<£< k (a')- 
The proof is thus complete. □ 
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